This procedure uses the DESeq2 package to analyze RNA-seq read count data, or similar integer matrix, with a multi-factor design. Therefore, it allows for testing differential expression between two conditions while there are confounding variables. It takes an integer matrix of gene-level read counts, normalizes it across samples, compare their gene expression profiles, and run DESeq2 to test differential expression.

 

Go to project home

1 Description

1.1 Project

Genetic Modifiers in Trisomy 21 Leukemogenesis.

1.2 Data

Number of unique RNA-seq read pairs mapped to known human genes, using data from re-run of the same samples.

1.3 Analysis

Trisomy vs. euploid 21.

2 Analysis and results

  • Total number of samples: 18
  • Total number of genes: 20483
  • Number of genes after filtering: 20410
  • Minimum number of total read:
  • Comparison:
    • Condition: Ploid
    • Group0: Trisomy21 (N0=9: S41, S42, S43, S50, S51, S52, S53, S54, S55)
    • Group1: Euploid (N1=9: S38, S39, S40, S44, S45, S46, S47, S48, S49)

2.1 Read count summary

The read count matrix includes 18 samples and 20410 genes after removing genes with total read counts few than . The global average of read count per gene is 3840.8 and the overall percentage of non-zero read counts is 93.88%.

Table 1. Summary statistics of read counts per sample: quantiles of read count per gene, average read count per gene, and percentage of non-zero read counts.

Ploid Clone Min. 1st Qu. Median Mean 3rd Qu. Max. Total Positive%
S41 Trisomy21 TMD100 0 132.00 2396.5 13514.96 11403.75 4580700 275840330 94.81
S42 Trisomy21 TMD100 0 31.00 581.5 3443.78 2881.75 1063841 70287530 95.50
S43 Trisomy21 TMD100 0 26.00 472.0 2900.79 2395.00 995817 59205187 94.48
S50 Trisomy21 TMD145 0 23.00 439.0 2692.84 2218.50 729128 54960958 94.72
S51 Trisomy21 TMD145 0 29.00 520.5 3171.65 2694.00 845270 64733306 95.67
S52 Trisomy21 TMD145 0 16.00 282.0 1615.42 1398.00 613385 32970685 93.28
S53 Trisomy21 TMD145 0 30.00 516.0 2808.84 2461.00 855515 57328377 94.18
S54 Trisomy21 TMD145 0 16.00 313.0 1820.08 1535.00 471903 37147874 93.34
S55 Trisomy21 TMD145 0 17.25 332.0 1944.37 1645.00 529318 39684685 92.69
S38 Euploid TMD100 0 12.00 235.0 1265.51 1084.00 362660 25829059 92.87
S39 Euploid TMD100 0 172.00 3189.0 19446.47 16531.50 5338934 396902392 94.13
S40 Euploid TMD100 0 28.00 530.5 3090.19 2617.00 910925 63070763 94.56
S44 Euploid TMD145 0 21.00 403.0 2523.69 2136.00 654094 51508424 93.05
S45 Euploid TMD145 0 15.00 277.0 1492.69 1297.00 372070 30465716 93.61
S46 Euploid TMD145 0 22.00 409.0 2312.95 2009.00 568334 47207280 95.25
S47 Euploid TMD145 0 11.00 229.0 1467.83 1210.75 399776 29958415 92.54
S48 Euploid TMD145 0 21.00 380.0 2504.53 2097.00 584805 51117526 94.75
S49 Euploid TMD145 0 7.00 161.0 1117.76 940.00 353517 22813438 90.49

2.2 Normalization

2.2.1 Size factor

DESeq2 uses the median ratio to estimate a size (normalization) effect for each sample in an RNA-seq data set. Size effects are positively correlated to total read counts of samples. Therefore, a simple normalization method is to divide all read counts of each sample by its size factor.

The size factors of all samples range between 0.394 and 7.343 (geometric mean = 1.004). In general, we expect a positive correlation between total read count of a sample and its size factor, and their geometric mean is close to 1.0.

Figure 1. Left: correlation between median read count per gene and size factor across all samples. Right: the distribution of size factors by groups. Average size effects of the 2 groups are 1.42 vs. 1.47 (p=0.9499).

2.2.2 Before vs. after normalization

After dividing the original read counts of each sample by its size factor, normalized read counts are log2-transformed (log2(c+1)). Data before and after normalization is compared in terms of their correlation, distribution, and variance across samples.

Figure 2. Gene-level read count of two samples with the smallest (S49) and the largest (S39) size factors.

Figure 3. Distribution of read counts before and after normalization with size factors. Read counts were log2-transformed.

2.2.3 Data distribution

Figure 4. Distribution of averaged gene-level read counts after normalization by size factors. The averages often have a bi-modal (two-peak) distribution. The narrow left peak corresponds to inactive genes with lower read counts and the wider right peak corresponds to active genes expressed at different levels.

2.2.4 Rlog data transformation

Regularized log (Rlog) transformation is another funtion implemented by the DESeq2 package to transform the original data to reduce systemic bias between RNA-seq libraries. It does 3 things:

  • Normalize original read counts by library size
  • Transform the read count data to log2 scale
  • Minimize difference between libraries for rows/genes with small counts (variance stablization)

The Rlog-transformed data has more “linear” distribution and will be used for some of the following steps, such as hierarchical clustering and PCA.

Figure 5. Data distribution after Rlog-transformation: average of all samples (left) and individual samples (right).

Figure 6. MA-plot shows the relationship between average read count and variance across samples. In the original data, genes with lower read counts have more noise and then higher variance across samples. The default DESeq2 normalization rescales the data with library size factors and the dependence remains (left). The Rlog transfromation applies a variance stablization step to penalize genes with lower read counts, so the dependence is gone (right).

2.3 Sample analysis

Analysis of samples by comparing their global gene expression profiles, potentially to identify outlier samples and confounding variables.

2.3.1 Hierarchical clustering

Figure 7. Unsupervised hierarchical clustering of samples based on their pairwise correlation, using Rlog-transformed data of all genes. The lowest common node of two samples on the clustering tree indicate their similarity (lower = more similar). Colors in the heatmap correspond to correlation coefficient of a sample pair.

2.3.2 Principal components analysis

Figure 8. Principal Components Analysis (PCA) is also an unsupervised analysis that converts a large number of correlated variables (genes) into a smaller set of uncorrelated variables called principal components (PCs). Each principal component accounts for certain percentage of total variance in a data set and the PCs are ordered by their percentages. This figure plots the top two PCs. Generally speaking, samples closer to each other on the 2-D space have more similar profiles of gene expression.

2.4 Differential gene expression

The analysis of differential gene expression reports a statistical table with 6 columns:

  • Column 1/2 - Group means: Group averages of normalized read counts (Trisomy21 vs. Euploid).
  • Column 3 - Log2FC: Log2 fold change of group means (Log2(Mean1/Mean0))
  • Column 4 - AdjFC: Log2 fold change adjusted by shrinkage esimator to penalize genes with low read counts.
  • Column 5 - Pvalue: Statistical significance of group difference, calculated by a negative binomial (Wald) test.
  • Column 6 - FDR: Adjusted p value known as false discovery rate, calculated by the Benjamini & Hochberg method.

2.4.1 DEG summary

Figure 9. Visualizations that summarize the differential expression of all 20410 genes.

  • M-A Plot This plot visualizes the relationship between average read counts on X-axis (A) and Log2 fold change on Y-axis (M). Each dot represents a gene. The red line is the LOWESS local smoothing fit, which indicates whether there is an overall pattern of skewed differential expression across the spectrum of gene expression levels.
  • Volcano Plot This plot puts together two types of differential expression statistics: Log2 fold change (size effect) and P value (statistical significance). Log2 fold change indicates magnitude of differential expression and p value indicates the consistency of change across multiple samples. Selection of differentially expressed genes usually needs to take both into account. Each dot represents a gene and its size is proportional to its distance from [0, 1]. The horizontal line corresponds to p=0.01 and the vertical lines correspond to 2.0 fold increase/decrease.
  • P Value Distribution This plot shows the counts of genes whose DESeq2 p values falling into each 1% interval. The first bar on the left then indicates the number of genes with p<0.01. When the data is completely random with enough samples, the number of genes in each 1% interval will be approximately the same (uniform distribution). The genes with significant p values would be considered as false positives if this is the case. A distribution skewed to the left suggests there are indeed differentially expressed genes (true positives).
  • False discovery rate This plots traces the number of selected genes with each FDR (false discovery rate) cutoff.

Table 2. Number of top DEGs selected via different FDR cutoffs. FDRs are calculated using the Benjamini & Hochberg method (Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300. 1995). A FDR of 0.15 to 0.25 (15% to 25% false positives) is acceptable for most follow-up analyses, while validation of differential expression via other technologies, such as qPCR, is highly recommended.

FDR Higher_in_Euploid Lower_in_Euploid Total
0.01 88 130 218
0.02 114 174 288
0.05 189 359 548
0.10 317 595 912
0.15 450 817 1267
0.20 587 1000 1587
0.25 779 1256 2035

2.4.2 Top genes

Figure 10. The top-ranked gene with the most up- (left) and down- (right) regulated expression in Euploid. Plotting data was normalized by Rlog, and then adjusted for Clone.

Figure 11. Heatmap of the top 100 and 100 DEGs with the most up- (red) and down- (yellow) regulated expression in Euploid. Each row corresponds a DEG and the colors represent its replative expression levels across samples. Samples are clustered by the genes and the columns are colored according to sample groups (blue=Trisomy21 and red=Euploid).

4 Appendix

Check out the RoCA home page for more information.

4.1 Fold change vs. log2(fold change)

The terms to represent differential expression can be used quite confusingly. In this report, fold change refers the ratio of two group means in their unlogged form. So a fold change of 2.0 means the average of the second group is increased to twice of the average of the first group; similarly, a fold change of 0.5 means the average is reduced to half. Log2(fold change) equals to the log2-transformation of the fold change. The table below gives a few examples of the conversion of these 2 variables. Log2(fold change) is more suitable for statistical analysis since it is symmetric around 0.

Supplemental Table 1. Fold Change vs. Log(Fold Change) vs. Percentage Change

Fold change Log2(fold change) Percentage change (%)
0.125 -3.000 -87.500
0.250 -2.000 -75.000
0.500 -1.000 -50.000
0.667 -0.585 -33.333
0.800 -0.322 -20.000
1.000 0.000 0.000
1.250 0.322 25.000
1.500 0.585 50.000
2.000 1.000 100.000
4.000 2.000 300.000
8.000 3.000 700.000

4.2 References

  • R: R Development Core Team, 2011. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0. Home page.
  • Bioconductor: Gentleman RC et al., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. Home page.
  • DESeq2: Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
  • RoCA
  • Awsomics

4.3 Reproduce this report

To reproduce this report:

  1. Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory

  2. To generate a new report using your own input data and parameter, edit the following items in the YAML file:

    • output : where you want to put the output files
    • home : the URL if you have a home page for your project
    • analyst : your name
    • description : background information about your project, analysis, etc.
    • input : where are your input data, read instruction for preparing them
    • parameter : parameters for this analysis; read instruction about how to prepare input data
  3. Run the code below within R Console or RStudio, preferablly with a new R session:

if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }

CreateReport(filename.yaml);  # filename.yaml is the YAML file you just downloaded and edited for your analysis

If there is no complaint, go to the output folder and open the index.html file to view report.

4.4 Session information

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] xlsx_0.6.1                  edgeR_3.24.1               
##  [3] limma_3.38.3                sva_3.30.1                 
##  [5] genefilter_1.64.0           mgcv_1.8-26                
##  [7] nlme_3.1-137                DEGandMore_0.0.0.9000      
##  [9] snow_0.4-3                  DESeq2_1.22.1              
## [11] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
## [13] BiocParallel_1.16.2         matrixStats_0.54.0         
## [15] Biobase_2.42.0              GenomicRanges_1.34.0       
## [17] GenomeInfoDb_1.18.1         IRanges_2.16.0             
## [19] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [21] awsomics_0.0.0.9000         rchive_0.0.0.9000          
## [23] colorspace_1.4-1            gplots_3.0.1               
## [25] MASS_7.3-51.1               htmlwidgets_1.5.1          
## [27] DT_0.15                     yaml_2.2.1                 
## [29] kableExtra_0.9.0            knitr_1.20                 
## [31] rmarkdown_1.10              RoCA_0.0.0.9000            
## [33] RCurl_1.95-4.11             bitops_1.0-6               
## [35] devtools_2.3.1              usethis_1.6.1              
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.0         rprojroot_1.3-2        htmlTable_1.12        
##  [4] XVector_0.22.0         base64enc_0.1-3        fs_1.3.2              
##  [7] rstudioapi_0.11        remotes_2.2.0          bit64_0.9-7           
## [10] AnnotationDbi_1.44.0   fansi_0.4.1            xml2_1.2.0            
## [13] splines_3.5.1          geneplotter_1.60.0     pkgload_1.0.2         
## [16] jsonlite_1.6.1         Formula_1.2-3          rJava_0.9-11          
## [19] annotate_1.60.0        cluster_2.0.7-1        readr_1.3.1           
## [22] compiler_3.5.1         httr_1.4.2             backports_1.1.5       
## [25] assertthat_0.2.1       Matrix_1.2-15          cli_2.0.2             
## [28] acepack_1.4.1          htmltools_0.4.0        prettyunits_1.1.1     
## [31] tools_3.5.1            gtable_0.3.0           glue_1.3.2            
## [34] GenomeInfoDbData_1.2.0 dplyr_0.8.5            Rcpp_1.0.5            
## [37] vctrs_0.2.4            gdata_2.18.0           crosstalk_1.1.0.1     
## [40] stringr_1.3.1          xlsxjars_0.6.1         ps_1.3.2              
## [43] testthat_2.3.2         rvest_0.3.2            lifecycle_0.2.0       
## [46] gtools_3.8.1           XML_3.98-1.16          zlibbioc_1.28.0       
## [49] scales_1.1.1           hms_0.4.2              RColorBrewer_1.1-2    
## [52] memoise_1.1.0          gridExtra_2.3          ggplot2_3.3.2         
## [55] rpart_4.1-13           latticeExtra_0.6-28    stringi_1.2.4         
## [58] RSQLite_2.1.1          highr_0.7              desc_1.2.0            
## [61] checkmate_1.8.5        caTools_1.17.1.1       pkgbuild_1.1.0        
## [64] rlang_0.4.5            pkgconfig_2.0.3        evaluate_0.14         
## [67] lattice_0.20-38        purrr_0.3.3            bit_1.1-14            
## [70] processx_3.4.2         tidyselect_1.0.0       magrittr_1.5          
## [73] R6_2.4.1               Hmisc_4.1-1            DBI_1.0.0             
## [76] pillar_1.4.6           foreign_0.8-71         withr_2.2.0           
## [79] survival_2.43-3        nnet_7.3-12            tibble_2.1.3          
## [82] crayon_1.3.4           KernSmooth_2.23-15     locfit_1.5-9.1        
## [85] grid_3.5.1             data.table_1.12.8      blob_1.2.1            
## [88] callr_3.4.3            digest_0.6.25          xtable_1.8-3          
## [91] munsell_0.5.0          viridisLite_0.3.0      sessioninfo_1.1.1

END OF DOCUMENT